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ON UNIVERSAL ORACLE INEQUALITIES RELATED TO 
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By Yuri Golubev 1 

CNRS, Universite de Provence 

This paper deals with recovering an unknown vector 9 from the 
noisy data Y = AO + <j£, where A is a known (m x 7i)-matrix and £ is 
a white Gaussian noise. It is assumed that n is large and A may be 
severely ill-posed. Therefore, in order to estimate 6, a spectral regu- 
larization method is used, and our goal is to choose its regularization 
parameter with the help of the data Y. For spectral regularization 
methods related to the so-called ordered smoothers [see Kneip Ann. 
Statist. 22 (1994) 835-866], we propose new penalties in the princi- 
ple of empirical risk minimization. The heuristical idea behind these 
penalties is related to balancing excess risks. Based on this approach, 
we derive a sharp oracle inequality controlling the mean square risks 
of data-driven spectral regularization methods. 

1. Introduction and main results. In this paper, we consider a classical 
problem of recovering an unknown vector 6 = (6(1), . . . ,9(n)) T 6 M. n in the 
standard linear model 

(1.1) Y = A9 + a£, 

where A is a known (m x n)-matrix and £ = (£(1), • • • , £,(m)) T is a standard 
white Gaussian noise in R m with E£(/c) = 0,E£ 2 (/c) = 1, k = 1, . . . ,m. The 
noise level a in (1.1) is assumed to be known. 

We start out by considering the maximum likelihood estimate of 9 

9q = argmin||y — ^46>|| 2 , 



where \\x\\ 2 = ^J^ = iX 2 (k). It is easily seen that O = (A T A)- 1 A T Y and that 
the mean square risk of this estimator is computed as follows: 

B\\9q - 9\\ 2 = a 2 B\\(A T A)- 1 A r Cf = a 2 t™ce[(A T A)" 1 } 
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k=i 

where X(k) and ^ £ ^ n are eigenvalues and eigenvectors of A T A 

A T A?p k = \(k)ip k . 

In this paper, it is assumed solely that A(l) > A(2) > • ■ • > A(n). So, A 
may be severely ill-posed and (1.2) reveals the principal difficulty in 9q: its 

risk may be very large when n is large or when A has a large condition 
number. 

The simplest way to improve 9q is to suppress large A -1 (A;) in (1.2) with 
the help of a linear smoother; that is, to estimate 9 by H9q, where H is 
a properly chosen (n x n)-matrix. In what follows, we deal with smooth- 
ing matrices admitting the following representation H = H a (A T A), where 
H a (\) is a function M + —> [0, 1] which depends on a regularization parameter 
a € [0, a] such that 

lim# Q (A) = l, lim HJX) = 0. 

This method is called spectral regularization [see Engl, Hanke and Neubauer 
(1996)] since A T A and H a (A T A) have the same eigenvectors. Summarizing, 
we estimate 6 with the help of the following family of linear estimators 

6 a = H a (A T A)(A r A)- 1 A T Y 

and our main goal is to choose the best estimator within this family, or 
equivalently, the best regularization parameter a. Note that a controls the 
mean square risk of 9 a , 

n n 

(1.3) L a {0) = nL ~ e\\ 2 = X> - h a (k)] 2 (9, V> fc > 2 + a 2 X: A" 1 ^ (*), 

k=l k=l 

where here and below we denote for brevity 

n 

h a (k) d ^ H a [\(k)} and <0,V^ = X>(O^(O- 

i=i 

According to (1.3), the variance of 9 a is always smaller than that of the 
maximum likelihood estimate, but 6 a has a nonzero bias and adjusting prop- 
erly a we may improve 9q. Note that this improvement may be significant 
if (9,'ipk) 2 are small for large k. 

In practice, a good choice of H a (-) is a delicate problem related to the 
numerical complexity of 9 a . For instance, to make use of the spectral cut-off 
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with H a (X) = 1{A > a}, one has to compute the singular value decomposi- 
tion (SVD) of A. For very large n, this numerical problem may be difficult 
or even infeasible. 

The very popular Tikhonov-Phillips [see, e.g., Tikhonov and Arsenin 
(1977)] regularization 

9 a = argmin{||Y - A6\\ 2 + a||0|| 2 } 
e 

does not require SVD. In this case, 9 a is computed as a root of the linear 
equation 

(aI + A T A)8 a = A T Y 

and therefore H a (X) = A/(A + a). It is worth pointing out that this regular- 
ization technique is good solely for ill-posed A. 

Another widespread regularization technique is due to Landweber (1951). 
This method is based on a very simple idea: to compute recursively a root 
of equation 

A T A6 = A T Y. 

Since A T Y = [A T A - al}6 + a6 for all a > 0, we get 9 = [I - a - 1 A T A]e + 
a~ 1 A T Y. This formula motivates Landweber's iterations defined by 

§ k = [I- a - 1 A T A]0 k - 1 + a- 1 A T Y. 

Thus, we can estimate 6 without computing SVD and without solving linear 
equations. It is easily seen that these iterations converge if A(l) < o and that 
the corresponding spectral regularization function is given by 

(1.4) H k {\) = l-\l-±\ . 

The regularization parameter of the Landweber method is usually defined 
by a = 1/k. Note that in spite of its iterative character, the numerical com- 
plexity of the Landweber method may be hight. Indeed, when the noise is 
very small, -H&(A) should be close to 1, and (1.4) implies that 

i \ i/ a \ def A(l) 
k > cond(A) - 



\( n y 

This means that if A is severely ill-posed, the number of iterations may be 
very large, thus making the method infeasible. A substantial improvement 
of Landweber's iterations is provided by the so-called ^-method [see, e.g., 
Engl, Hanke and Neubauer (1996) and Bissantz et al. (2007)]. 

All the above-mentioned regularization methods are particular cases of 
the so-called ordered smoothers [see Kneip (1994)] defined as follows. 
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Definition 1. The family of sequences {h a (k),a G (0,a],k G N + } is 
called ordered smoother if: 

1. For any given a G (0, a], h a (k) :N + — > [0, 1] is a monotone function of k. 

2. If for some a±,a2 G (0, a] and k! G N + , h ai (k') < h a2 (k'), then h ai (k) < 
h a2 (k) for all &GN+. 

It was Kneip who noted that from a probabilistic viewpoint, all ordered 
smoothers are equivalent to the spectral cut-off with h a (k) = l{X(k) > a}. 
This profound fact plays an essential role in adaptive estimation since it helps 
to analyze precisely statistical risks of feasible data-driven regularization 
methods. This is why in this paper we deal solely with the ordered smoothers. 

Whatever inversion method is used, the principal question usually arising 
in practice is how to choose its regularization parameter. Traditional the- 
oretic approach to this problem is related to the minimax theory; see, for 
example, Mair and Ruymgaart (1996) and O'Sullivan (1986). However, this 
approach provides the smoothing parameters depending strongly on an a 
priory information about 9 which is hardly available in practice. The only 
one way to improve this drawback is to use data-driven regularizations. In 
statistical literature, one can find several general approaches for constructing 
such methods. We cite here, for instance, the Lepski method which has been 
adopted to inverse problems in Mathe (2006), Bauer and Hohage (2005), 
Bissantz et al. (2007) and the model selection technique which was imple- 
mented in Lubes and Ludeha (2008). 

In this paper, we take the classical way related to the famous principle of 
unbiased risk estimation which goes back to Akaike (1973). The heuristical 
motivation of this approach is based on the idea that a good data-driven 
regularization should minimize in some sense the risk L a (9) [see (1.3)]. This 
idea is put into practice with the help of the empirical risk minimization 
suggesting to compute data-driven regularization parameters as follows: 

(1.5) a = arg min R a [Y, Pen] , 

ae(0,a] 

where 

R a [Y,Pen] = \\§ - 9 a \\ 2 + a 2 Pen(a), 

and Pen(a) : (0, a) — > M + is a given penalty function. The most important 
problem in this approach is related to the choice of the penalty. Intu- 
itively, we want that the method mimics the oracle smoothing parameter 
a* = argmin Q L a {9). This is why we are looking for a minimal penalty that 
ensures the following inequality: 



(1.6) 



L a (9)<R a [Y,Pen]+C, 
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where C is a random variable that does not depend on a. It is easily seen 
that in the considered statistical model, 

n 

c = -\\e-e \\ 2 = -a 2 Y J ^ 1 m\k). 

k=l 

Traditional approach to solving (1.6) is based on the unbiased risk esti- 
mation defining the penalty as a root of the equation 

L a {6) = Ei? Q [Y,Pen] + EC. 

Unfortunately, in spite of its very natural motivation, this penalty is not good 
for ill-posed problems [see Cavalier and Golubev (2006) for more details]. 

The main idea in this paper is to compute the penalty in a little bit 
different way, namely as a minimal function assuring the following inequality: 

(1.7) Esup[L Q (0) - R a [Y, Pen] - C} + < KE[L & (9) - R^Y, Pen] - C} + , 

where [x]+ = max{0, x} and K > 1 is a constant. The heuristical motivation 
behind this approach is rather transparent: we are looking for a minimal 
penalty that balances all excess risks uniformly in a G (0, a]. Recall that the 
excess risk is defined as the difference between the risk of the estimate and 
its empirical risk. Note that according to (1.6), we may focus on the positive 
part of the excess risk, and that equation (1.7) guarantees that for any data 
driven smoothing parameter a 

E[L & (6) - R & % Pen] -C] + < KB[L a (6) - i^[Y, Pen] -C} + . 

In order to explain how one can compute good penalties assuring (1.7), we 
begin with the spectral representation of the underlying statistical problem. 
We can check easily that 

y(k) = (Y^ k )X-\k) = {9^ k )+a\- l l 2 (k)t(k), 

where £(fc) are i.i.d. AA(0, 1). With this notation, 9 a admits the following 
representation: 

{L^k)=K(k)y(k) = h a (k)e(k)+ah a (k)\~ l l 2 (k)i(k), 

where 6{k) = (9,ipk)i an d 

n 

5>-/la(fc)]V(*0, 
)( 

Y,[0{k)-K{k)y{k)] 2 . 
k=l 
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In what follows, it is assumed that the penalty has the following structure: 

n 

Pen(a) = 2^2 X' 1 (k)h a (k) + (1 + j)Q(a), 

k=l 

where 7 is a positive number and Q(a),a > 0, is a positive function to be 
defined later on. Then the excess risk is computed as follows: 

L a (6) - R a [Y,Pen] -C 

n 

(1.9) =a 2 Y J ^~\k)[2K{k) - h 2 a {kM 2 {k) - 1) - (l + 7 )<x 2 Q(a) 

k=l 

n 

-2aY,^ 1/2 {k)[l-K{k)] 2 i{k)0{k). 

k=l 

Our first idea in solving (1.7) is to use the fact that the absolute value of 
the cross term 

n 

2aY J ^ l, \k)[l-K(k)\ 2 ak)0{k) 

k=l 

is typically smaller than L a (6) (for more details, see Lemma 9 below). There- 
fore, omitting this term in (1.7), we get the following inequality for Q{a): 

(1.10) Esupfo a - (1 + 7 )Q(a)]+ < KE[ m - (1 + i)Q(a)]+, 

a<a 

where 

n 

T)a =J2 ^Hk)[2h a (k) - h 2 a (k)](e(k) - 1). 
k=l 

Usually, computing the minimal function Q(a) assuring (1.10) is a hard 
numerical problem. However, when h a {k) is a family of ordered smoothers 
it can be solved relatively easily. The main idea is to find a feasible solution 
Q°(a) of the marginal inequality 

(1.11) V[ri«-Q°(a)] + <V[ri«-Q°(a)) + 

and then to show that (1 + ^)Q°{a) satisfies (1.10). To solve (1.11), we use 
the following inequality: 

(1.12) E[rj-xf + <r(p + l)A~ p exp(-Ax)Eexp(Ar ? ), 

which holds for any random variable i] and for any A > 0. Its proof follows 
from the Chernoff bound. Without loss of generality, we may assume that 
Q°{a) = 0. Therefore, according to the Cauchy-Schwarz inequality 

B[ m -Q°(a)] + <^/Bvi=D(a), 
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where 

{n ^| V 2 

2Y j \- 2 {k)[2h a {k)-hl{k)] 2 \ . 

Hence, Q°{a) is computed root of equation 

inf exp[-AQ°(a)]Eexp(A7/ Q ) = D(a). 

A 

It is not difficult to check with a little algebra that 



2p a p a (k) ' 



(1.13) Q°(a) = 2D(aK^- 

where p a is a root of equation 

(1-14) f>[/W*0] =log§|^ 

fe=i ^ ' 



and 



(1.15) 



1 2x 2 
F(x) = - log(l - 2t) + x + 



2 ov ' 1 — 2ac 

jO a (/s) = v/2 J D- 1 (a)A- 1 (^)[2^a(fc) - >£(*;)]. 



The only one numerical difficulty in computing Q°{a) is related to (1.14). 
However note that in the proof of Lemma 7 it is shown that 



/(aO = X>[w>«(*)] 



fc=l 

is a strictly monotone function and therefore (1.14) may be solved exponen- 
tially fast. Note also that Lemma 7 provides lower and upper bounds for 
Q°(a) and p a . In particular, for some constant C > 

(1.16) C- x D{a) log 1 ' 2 ^0- < Q°(a) < CD (a) log 

The next theorem shows that Q°(ot) computed as a root of the marginal 
inequality (1.11) satisfies the global inequality (1.10). 

Theorem 1. Let Q°{a) be defined by (1.13)-(1.15). Then for any 7 > 
r > 0, 

Esupfo* - (1 + 7)Q°(a)]l +r < _ 3 J , 
where here and throughout the paper C denotes a generic constant. 
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The following theorem represents the main result in this paper. It controls 
the performance of the empirical risk minimization by the penalized oracle 
risk defined by 



where 



r(9) = inf R a (9), 

a<ce 



R a {9) = E{R a [Y, Pen] + C} = L a (9) + (1 + j)a 2 Q°{a) 



Theorem 2. Let Pen(a) = 2££ =1 \- 1 (k)h a (k) + (l+'y)Q (a) withQ°(a) 
defined by (1.13)-(1.15). Then the mean square risk of 9^ with the data- 
driven smoothing parameter a defined by (1.5) satisfies the following upper 
bound: 

1/2 < 



1 + 



(1.17) E\\9-9 & \\ 2 <r(8) 
which holds true uniformly in 



— log 

7 



-1/2 



Cr{8) Ca 2 D(a) 



a 2 D(a 



+ 



7 4 r(#) 



Below, we discuss briefly some statistical aspects of this theorem. 



1. Equation 
equality 



'1.17) represents a particular form of the so-called oracle in- 



E| 



9&f<r{9)\ 1 + 



a 2 D(a) 
r{9) 



where ^(-) is a bounded function such that lim^^o ^(x) = 0. This means 
that if the ratio a 2 D(a)/r(9) is small, then the risk of the method is 
close to the risk of the penalized oracle. On the other hand, if this ratio 
is large, then the risk of the method is of order of the oracle risk. 

Note also that (1.17) is a universal oracle inequality which holds true 
whatever is the ill-posedness of the underlying inverse problem. It gen- 
eralizes the corresponding oracle inequalities in Cavalier and Golubev 
(2006) and Golubev (2004) obtained for the spectral cut-off method. 
Theorem 2 reveals some difficulties related to the data-driven choice of 
the regularization parameter in the Tikhonov-Phillips method. Recalling 
that for this method H a (X) = A/ (a + A), we obtain 



D 2 (a) = 2^A- 2 (fc)[2Mfc) " h 2 a (k)f 



>2^\- 2 (k)h 2 a (k) 



k=l 



k=\ 



1 



2n 



n 

Ei an 
=i [a + X(k)\ 2 ~ [a + A(n)] 2 " 
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Since Q°(a) > D{a) x y/n, it is clear that the penalized oracle risk of the 
Tikhonov-Phillips regularization may be very large compared to the risk 
of the method computed for given a. This means that in practice, the 
Tikhonov-Phillips regularization with a data-driven smoothing parame- 
ter may fail. 

Note however, that this drawback can be easily improved with the help 
of high-order Tikhonov-Phillips regularizations computed as follows: 

= arg min{ 1 1 A9^ - AO \ \ 2 + a 1 1 9 1 1 2 } , 
e 

where §a^ stands for the standard Tikhonov-Phillips regularization. One 
can check with a little algebra that the corresponding smoothers are 
given by H^(X) = \ k {a + \)~ k ,k > 2, and everything goes smoothly in 
this case. 

3. If the inverse problem is not severely ill-posed; that is, X(k) > Ck~@ for 
some (3 > 0, then, according to (1.16), for reasonable spectral regulariza- 
tions 

n 

(1.18) ^ft 2 (A;)A- 1 (A:)»Q (a), 

fc=i 

when a is small. This means that the risk of the penalized oracle is close 
to the risk of the ideal oracle inf Q <„ L a (9). 

This remark together with the famous Pinsker (1980) minimax theorem 
shows that our method results in adaptive asymptotically (as a — > 0) 
minimax regularizations. To demonstrate this, suppose for simplicity that 
n = oo and that 9 belongs to the following ellipsoidal body 

Q(W) = € / 2 (l,oo) :f^{9^ k ) 2 b 2 [\-\k)] < W j, 

where \b(x)\ is a nondecreasing function such that for some p,q€ (0,oo), 
Cx p < b(x) < Cx q . Then it follows from Pinsker (1980) that as a — > 

inf sup ||#-^|| 2 = (l + o(l))inf sup L[9,h] 

= (l + o(l)) sup inf L[9,h] 
eee(w) h 

= (l + o(l)) sup inf L[9,h* a ], 
where inf at the left-hand side is taken over all estimators, 

n n 

L[9, h]=^[l- h(k)} 2 (9, + a 2 \-\k)h 2 {k) 

k=l k=l 
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and h* a {k) = [1 — a|6[A~ 1 (A;)]|]+. Recall that from a statistical viewpoint, 
the main drawback in this minimax result is that the optimal smoothing 
parameter 

a*(W) = argmin sup L[9,h* a ] 
ae(o,a] 0eo{w) 

depends on the size W of G(VK) which is hardly known in practice. In 
order to overcome this difficulty, one may use the data-driven regular- 
ization a with h a (k) = h* a (k). Noticing that this family of smoothers is 
ordered, we get according to Theorem 2 and (1-18) 

sup ||(9-/9 & || 2 = (l + o(l))inf sup \\0 - 0\\ 2 as a -> 0. 
6>ee(wo e eee(w) 

Another interesting situation is related to the case when the inverse 
problem is severely ill-posed; that is, when the eigenvalues of AA T are 
exponentially decreasing, A(fc) ~ exp(— /3k) with some /3 > 0. Then for 
small a 

n 

^h 2 a (k)\-Hk)^Q°(a). 

k=l 

This means that the risk of the penalized oracle is essentially greater than 
that of the ideal oracle. In this situation, Theorem 2 provides an upper 
bound similar to Golubev (2004). It is worth pointing out that neither 
(1.17) nor the extra penalty can be improved in this case [for more details, 
see Golubev (2004)]. 



2. Proofs. 



2.1. An exponential chaining inequality. Let £f be a separable zero mean 
random process on M + . Denote for brevity 

A ? (t 1 ,t 2 )=6 1 -£t 2 - 

We begin with a general fact similar to Dudley's entropy bound [see, e.g., 
Van der Vaart and Wellner (1996)]. 

Lemma 1. Let 0" 2 ,uGlR + , be a continuous strictly increasing function 
with <7q = 0. Then for any A > 0, 

, „ f A e (n,t) 
logEexp< A max — 

o<u<t at 

(2-1) 

log(2V2 f .A^u,v) 

< — — h max max log.Lexp< z\- 



^2-1 0<u<u<t| 2 |<^2/(^2_i) [ A (T (t>,1i 



where A a (v,u) = \J\a~l — a%\ 
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Proof. The proof of (2.1) is based on the standard chaining argument 
[for more details, see Van der Vaart and Wellner (1996)]. Denote for brevity 
by t-(B) and t + (B) left and right elements of a closed subset B in 1R + . 
First, we construct a dyadic partition of [Q,t], Let 

rt = {u>0:*l<f}, T} = {u<t:ol>°±Y 

Next, we partition Tj 1 and 7^ as follows: 



7? = <« 



c ~-l. 2 < a t+ (7? ) 1 



2 

7 2 = < u G 7x : cr n > — - — 



/a = <i w G / 2 : cr n < z 



r2 j rT i, T 2, A^[t + (r 2 1 ),t_(r 2 1 )] 



Doing so, after p steps, we get partitions Tj ,j = 1, . . . , 2 P , such that for any 

x,yeT k 

(2.2) Al(y,x)<2~ k al 

With the sets T k ,j = 1, . . . , 2 k we associate the set of their right points 

r k =\J{t + (Tf)h 

3=1 

and for any point x S r k we denote by Tk-\{x) the nearest point in r fc_1 . 
So, by (2.2), for any v £t p 

(2.3) Al(T p ^{v),v)<2-^al 

With this notation, for any u € r p , setting tq(v) = t, we obtain 



&• - 6 = - ^(u)] < ^ sup & " 

fc=l fc=l«er* 

(2 - 4) 

= V sup ^ u,Tfc_i(t;) x — . 
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To bound the right-hand side at the above display, we use the elementary 
inequality 



(2.5) 



log E exp 



^2q(k)r)(k) < ^g(/c)logEexp[?7(£;)], 



which holds for any random variables r)(k) and any given q(k) > with 
^2kq(k) = 1. The proof of (2.5) follows immediately from the convexity of 
exp(x) which implies 

Eexp< ' S ^2 i q{k){ri{k) - log Eexp [^(fc)]} > 
^ k J 

< l^2q(k)Eexp[r](k) - logEexpf^/c)]] 1 =1. 



2 -*/2 



Applying (2.5) with 
q(k) 

r](k) 



A 



-— sup A CT [v,Tk-l{V)\ x -=— — , 

a t q{k) veT k A a [v,T k -i(v)] 



we obtain by (2.3) and (2.4) 



(2.6) log E exp exp 



iv(k)}. 



It is easily seen that for any A > 
Eexp[r/(fc)] 

A 



Eexp 



(2.7) 



— — — sup A ff rn D ,!) x -=— — — - 

q{k)a tv&T k Ao-Tfc-i 



< 



Eexp 



ii£t 



— — A a Tfc_i(u),u x — - 

q(k)a t A a [Tk-i(v),v\ 



< 2 sup sup E exp 



u < v \z\<V2/(V2-i) I A a (v,u) 

In the above equation, it was used that X^s=i 2 _s / 2 = l/(y/2 — 1). 
Finally, substituting (2.7) into (2.6), we arrive at 



, „ r A £ (u,t)i 

logEexp< A sup — 12 > 



< 
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thus completing the proof. □ 
2.2. Ordered processes. 

Definition 2. A zero mean process £t,t € M + is called ordered if there 
exists a continuous strictly monotone scaling function of , i € M + and some 
A > such that 



(2.8) 



sup E exp 



A 



Ag(-»,f) 
A a (v,u) 



< oo. 



A banal example of an ordered process is a standard Wiener process wt- 
In this case, o\ = t and obviously 



Eexp 



A 



A w (u,v) ' 
V\ v ~ u \- 



exp(A 2 /2). 



Lemma 2. Let £j &e an ordered process with £o = 0. T/ien i/iere exists a 
constant C such that for all 1 <p,q < 2, uniformly in z > 0, 

(2.9) Esup[&-*o?E.< 



where [x] + = max(0,x). 

Proof. Without loss of generality, we may assume that that lim^oo <7j 
oo. For any integer k > 0, define £fc(z) as a r °°t of the equation 

q l _ 2 1 /('?- 1 )^ 

Then we have 

oo 

E supfe - zaf] p + < T E sup [6 - za\\\ 
t>o k=0 te[t k (i),t k+1 (z)] 

oo 

(2.10) <^E sup 

fe=0 t£[t k (z),t k+1 {z)] 
oo 

<E sup |6| p + Ve[ sup it-za q [z) 
o<t<h(z) k=1 l 0<t<t k+1 (z) 
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According to Lemma 1, the first term at the right-hand side of the above 
inequality is bounded as follows: 

e sup i&r < cyvf , . = c , pP ^ , 

o<t<t l(z ) ~ tl{z) W(ff-D' 

whereas the second one, by (1.12) and Lemma 1, is controlled by 

p 



(2.11) 



E 



sup 6-*< w 

*<*fe+i(7) 



Z— / *fc+i(z) 



fc=l 



sup 



6 



2(7 



> 



*fc(z) 



t<ifc+i(z) a t k +i{z) a t k+1 (z) 



< 



Z / ifc+1 



exp 



k=l 



-A- 



tfc(z) 



^ / <,-, u , / <,-> I> +1 >'' / "~"^ 

C 



< 



fc=i 



(1 + 



sp/fa-i) ' 

So, combining the above inequality with (2.10) and (2.11), we arrive at (2.9). 

□ 

The next very simple lemma is useful for understanding the fact that the 
ordered process is controlled by its variance of . 



Lemma 3. Let £t,t £ , be a random process such that 

C 



(2.12) 



for any z > and some p>l, q > 1. TTien t/iere exists a constant C' such 
that for any random variable r € M + 

[E\C T \ p ] 1/p <C'[Ea q T p ] 1/{p<l) . 

Proof. According to (2.12) and Minkowski's inequality, we obviously 
have 

mrn 1/p <{mr-zat+z<4n 1/p 



< {Emaxfe - za q t f] 1/P + z[Ba pq ] 



Vp 



and minimizing the right-hand side in z we finish the proof. □ 



C 
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2.3. Ordered processes related to the spectral regular ization. In this sec- 
tion, we focus on typical ordered processes related to the empirical risk 
minimization. 

For given a° £ (0, a], define the following Gaussian processes: 



£ = X><*°(*0 - h a o +a (k)]b(k)£(k), 0<a<a-a°, 
k=i 

n 

C = J>a°(fc) " Ko_ a (k)]b{k)i{k), < a < a°, 



k=l 



where are i.i. d. 7V(0, 1) and Y2=i b2 ( k ) < 00 ■ It: is easil y 

seen that 

and £~ are ordered processes. Indeed, since they are Gaussian, we can choose 



and it suffices to check that 

|E(£) a -E(£) a |>E(£-£) 2 , 

or equivalently, 

E£±£± >min{(<) 2 >±) 2 }. 
If ot\ < o<2, then we have 

n 

k=l 
n 

< y~][h a o (fc) - /i a °±ai(fc)][^a (fc) - /la°±a 2 ( fc )]^ 2 ( fc ) 



fc=l 



= E£ ± £ ± . 
Therefore, according to Lemma 2, we get 



E sup 

0<a<»— a° 



E sup 

0<a<a° 



\k=l 



9/2- 



< 



< 



C(p,g) 
ZP/fa-l) 



and combining these inequalities we arrive at the following lemma. 
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Lemma 4. For any z > 0, 



E sup \ Yj[Ko(k) - h a (k)]b(k)C(k) 



0<a<a 



.k=l 



Y}K°{k)-K(k)} 2 b 2 (k) 



q/2\p 



< 



C(p,q) 
«p/(9-1) 



The next fact is essential for bounding cross terms in the empirical risk. 

Lemma 5. Let a° be a given smoothing parameter. Then for any p £ 
[1,2), there exists a constant C(p) such that for any data-driven smoothing 
parameter a, 



E 



5>&(fc) - M*O]A~ 1/2 (*O0(fc)£(fc) 

k=l 

ley \ ^*-> 

(2.13) <C(p)|EmaxA" 1 (A;)/i|(A:)} P ^[1-VW] 2 ^W 

/2 r °° 

+ C(p){maxA- 1 (/fc)/^o(£:)} P J E^[l - h & (k)] 2 9 2 {k) 



p/2 



p/2 



fe=l 



Proof. From Lemmas 4 and 3, it follows that 



E 



(2.14) 



X>d(*0 - (fc)]A- 1 /2( A .)^( A .)^( A .) 



fc=i 



< C(p)i E^[/ ia o(A;) - ^(^)] 2 A- 1 (fc)0 2 (fc) 



p/2 



fc=l 



To bound from above the right-hand side of (2.4), we use that h a (-) is a 
family of ordered smoothers. With this in mind, let us assume for definiteness 
that h ai (k) > h a2 (k) for ot\ < 02- Then, if a > a°, we get 



h&(k) 
h a o(k) 



<L 



hg{k) 
h a o(k) 



>h a (k) 



and thus we obtain 



^[/w>(AO-M£O] 2 A _1 (AO0 2 (*O 



fc=i 
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IT 



(2.15) 



k=l 



\- 2 (k)8 2 {k) 



h a o(k)_ 

oo 

< maxA -1 (lfe)j£o(fc) ^[1 - h & {k)} 2 6 2 (k) . 

k k=l 
Similarly, if a < a° , then 

oo 

Y,Ko(k)-h & (k)} 2 \-\k)e 2 (k) 



k=l 



(2.16) 



< max \~ l {k)h\(k) - h a o{k)] 2 6 2 {k) 



+ maxA -1 ^)/!^ (jfe) - h & (k)] 2 2 (k). 

k k=l 

Therefore, combining (2.15) and (2.16), we get (2.13). □ 
The following important ordered process is defined by 

n 

Ca = Y^h a (k)b(k)[Z 2 (k)-l], 



fc=l 



where £(fc) are i.i.d. Af(0, 1). Let 



a. 



2j2hl(k)b 2 (k) 



L k=l 
2 T7/-2 



1/2 



It is easy to check that minjE^^ , E£„ 2 } < E£ Q2 £ Q1 and thus 

7 -h II 2 



where 



|/i Ql - /iq, 2 || 2 = ^V(fc)[/i Ql (£;) - /iQ, 2 (fc)] 2 . 



fc=i 



Hence, in order to apply Lemma 2, it remains to check that for some A > 

A c (ai,a 2 ) 



We have 



Eexp 



A 



sup Eexp 

Otl,Ot2 



A 11 (a 1 ,a 2 ) 

I ^-02 II b_ 



|/l Ql /la 2 life 



< OO. 
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A 



: exp< 



y/2\\h ai - h, 



a 2 \\b k=l 



Y, h (k)[KM-Ki{k)\ 



i n 



k=l 



1 - V2A 



b(k)[h ai (k) - h a2 (k)] 
\\h ai h a2 



Since obviously 



max{\b(k)\\h ai (k) - h a2 (k)\} < \\h ai - h a2 \\ b , 
k 

then using the Taylor expansion for log(l — •) at the right-hand side of (2.17), 
we get for A < 1/2 

EexpfA ,,^ 1 ;^, l<exp(CA 2 ), 



thus proving (2.8). Hence, with the help of Lemma 2 we obtain the following 
fact. 



Lemma 6. For any z>0, 



E sup 

oe(o,a] 



j> a (*)&(fc)[£ 2 (fc)-l] 



k=l 



2j2hl(k)b 2 (k) 



k=l 



q/2- 



< 



C(p,q) 



2.4. Proof of Theorem 1. The next lemma describes some basic proper- 
ties of the universal penalty defined by (1.13)— (1.15). 



Lemma 7. For any a £ (0,a], 
(2.18) 



(2.19) 

(2.20) 
(2.21) 
(2.22) 



D(a) Q°(a) 
logT-r < Ma" 



D(a) -™D(a) 



1 / D(a) 1 
M ^ mm \2V l0g ^M'4 

Q°(a) > D( a ) ^ lQg ^H xl/2 



£>(a) - D(a) V D{a 



D(a) HaQ°(a) /. fi a Q°{a) 

> =r— ^ / log ■ 



D(a) ~ D(a) 

D( ai ) Q°(ai) 
D(a 2 ) ~ Q°(a 2 y 



D{a) 

Oil < 02- 



i/L»(a) >exp(2)D(a), 
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Proof. It follows from (1.13)-(1.15) that 

Q°(a) Afl, r „A , D(a) 

^ a lD{a) + ^\ 2 g t 2 /Wa(*0] + Papa{k)j = log 

and together with the inequality log(l — 2x)/2 + x < 0, we get (2.18). 
To verify (2.19), note that 



F(x) < 



2x 2 



l-2x 

and therefore, for any \i E [0, 1/4], 



k=l 

So, if // Q < 1/4, then 

9 1 „r /, m 1 , D(a) 
k=i y ' 

thus proving (2.19). 

Next, note that the following inequality holds: 

(2.23) F(x) > 

since 

f(x) = F(x) - = \ log(l - 2x) + x + 

is a nonnegative function for x > because 

and /(0) = 0. 

According to (2.23), -F(x) > x 2 and therefore, by (1.14), 

2 . 
Mo- < 



D(a)' 

Substituting this inequality into (2.18), we get (2.20). 

We now turn to the proof of (2.21). Again, combining (2.23) with (1.13)- 
(1.15), we arrive at 

Q°(a) = 2D(a)n a £ : m < £ F[^p a {k)] 

(2.24) 

2.0(a) , D(a) 
Ma £>(«) 
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and to get (2.21) it remains to invert this equation. We proceed to show 
that if x > exp(2), then the inequality x\og{x) > y implies 

(2.25) X ^T J T^- 

It is clear that G{y) = y/log(y) is an increasing function when y > exp(l) 
and (2.25) holds since 

xlog(x) x 

x ^* = . 

— log (x log (a;)) 1 + loglog(x)/log(x) 

Inverting (2.24) with the help of (2.25), we finish the proof of (2.21). 
Finally, (2.22) follows from the fact that 



D(a) 1 - 2p a p a (k) 

is a decreasing function in a > 0. To check this, let us note that g{a) is a 
root of the equation 

D(a) 



expi inf [—G a (fi) - pg(a)] \ 

lu>0 J 



^ t >0 L " vrv ^v-vJj D ( a y 
where 



k=l 



VPa(k) + ~ log[l - 2fip a (k)} 



However inf At >o[— G a (fi) — fix] is obviously a decreasing function in x and 
therefore if D(a) / D(a) is decreasing in a, then g(a) is decreasing in a too. 
□ 

We are now in a position to prove Theorem 1. Let a/., k = 0, . . . , be the 
decreasing sequence defined as follows: 

a = a, Q°(a k ) = (l + 5) k - 1 Q°(a 1 ), 

where 5 < 1/2 is a small positive number which will be chosen later on, and 
a\ is a root of equation 

D(ai) = D(a)exp(2). 

Denote for brevity = /?(«&) an d Qk = Q°( a k)- 
We begin with the simple inequality 



Esup[r ?a -(l + 7 )Q°(a)] 



1+r 



n 

<^E sup [C«-(l + 7)^-1]+^ 



k=l 



a k <a<a k 



-1 
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5^EK afc -e7Qfc-i+ sup [Ca - Ca fc ] - (1 + 7 - £7)Qfc-l 

a k <a<a k _ 1 



k=l 



21 

l+r 



Using that [x + y]^ +r < 2 r [x]^_ +r + 2 r [y]+ +r , we can continue the above equa- 
tion as follows: 



°r„ Ml+r 



Esupfo*- (1 + 7 )Q»] 



(2.26) 



<2 r ^E[C Qfc -(l + 7-£7)Q fc -i] 



l+r 



fc=l 



+ 2 r ^E[ sup [Ca-Ca fc ]-£7Qfc- 



fc=l 

We control the first term 
def 



l+r 



Ax ( 7 , e) = ^ E[C Qfc - (1 + 7 - £7)Q*-i] 



l+r 



fc=l 



at the right-hand side of (2.26) with the help of (1.12). Thus, we obtain for 
any \ k > 

l+r 



A 1 ( 7 ,e)<E|C ai | 1+r + ^^ +? -E 



fc=2 



^- (1+7 - £7) ^- 



< CD 1+r (a) 



(2.27) + r(l + r) £ Afc 1 -^ exp{ -A fc |* 

k=2 ^ k 



7(1 -e)Q fc -i 

Qk 

Qk-1 



Qk 



<[X k ^-Xk Q 



According to (1.13) and (1.14), we have with X k — 



L> fc Eexp( A fc — i - A fc — ) =L> , 



and substituting this into (2.27), we get 

A! (7, e) < CD 1 ^ \ 1 + Afe 1_r exp 



(2.28) 



k=2 



~ { 1 {l-e)-5)Q k 
k (l + S)D k 
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+ r log 



Dk 
D n 



Since by (2.18) 
(2.29) 



A fe — >log— , 



and according to (2.19), is bounded from below by a constant, we obtain 
from (2.28) 



(2.30) Ai (7, e) < CD 1 ^ I 1 + exp 

Next, according to (2.21) and (2.19), we get 
(2.31) 



7(1 -f)-j V Dk 



log— >-^-log — — 



and with this inequality we obtain from (2.30) 

' 7 (l- e )-«S 

exp<j -y 

k=2 



A l(7 , £ ) < CD^^- ^i + s d ~ r ) log 



log 



< £ exp j - ( 7(1 1+ ^ 6 - r) [k log(l + 5) - log 



(kS)]\. 



k=2 

Finally, one can check by the Laplace method that 



thus yielding 
(2.32) 



C ( z \ 
exp[-zik + z 2 log(fc)] < — ( — I 



zi,z 2 > 0, 



Ai(7,e)< 



5[ 7 (l- £ )-5-(l + %] + " 
Our next step is to bound from above the last term in (2.26), namely, 

n j_. 

A 2 (7, e) = f ^2 E sup [Cq - Cq J - eiQk-i 

k= l l O! k <a<a k _i 

Consider the following random processes: 

Ca(k) = (a ~ Ca k , t € [0, - a k ] . 



Denote for brevity a a (k) = J~E(%(k). Noticing that h a (k) = 2h a (k) — h 2 a (k) 
is a family of ordered smoothers, it is easy to check that 

a 2 u (k) - 5*(k) > E[C„(fc) - Cv(k)?, u < v. 
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According to the Taylor formula, for all u > v and all A > 0, 



EexplA 



( y ) 



and applying Lemma 1 and (1.12), we obtain for any A^ > 

l+r 



A 2 (7, e) = ^ E [ sup Ca (fc) - £7Q 



fe-1 



<c^+cx;5 j i- ai (fc)A^ 

A fc £7Q fc _i 



fc=2 



x exp 



+ 



Substituting 



Afc 



&a fc _x- afc (*;) (V2-1) 2 

(x/2-l) 2 £7Qfc-l 



2a, 



afc_i— a* 



into the above equation and noticing that Qk > D^, we obtain 



C 



0", 



A 2 (7, E )<CDr+ (e7)1+r ^^ 



2(l+r) 



1-r 



x exp 



(2.33) 



(V2-l) 2 £ 2 7 2 QLl" 



4a 2 

w a fe _i-a fc 



(k) 



c 



fc=2 



x exp 



According to (2.22), 



D fc <_gL<(i + 5) 2 



dU ~ Ql-i 

and with this inequality we continue (2.33) as follows: 



A 2 (7, £ )<^ + ^p,£ 



fc=2 



fc-1 
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Next, substituting (2.29) and (2.31) into this equation, we get 



A 2 ( 7 , e) < CDl+r + j-^ Ql-i ex P 



k=2 



(^-1)V 7 2 2 D fc _i 
— log 



85tf 



ak-i 



Dn 



r<xl+r n 



exp 



k=2 



Ce 2 7 2 ^ 2 Q k -i 



■log' 



Dn 



< 



C5 1+r D l 



r r>l+ r 



(67) 



l+r 



^exp 



fc=2 



(7 F 2 2 

5(r + l)k -i- [Jfe log(l + <5) - log(5£;)] 2 



Bounding the last sum in this display with the help of the Laplace method, 
we get 

^expj <f(r + -^-[Hog(l + 6) - log(6k)f 

l 1 *> 



< exp 



C5 3 



_e 2 7 2 log 2 (l + <5) 
and therefore with 5 = e 2 j 2 we obtain 



CV5 



e 7 log(l + S) ' 



(2.34) 



rr> 1+r 

A 2 ( 7 ,6)< 



(S7) 



l-r " 



With this 5, equation (2.32) becomes 



(2.35) 



Ai( 7 ,e)< 



CD 1+r (a) 



(7e) 2 [ 7 - r - 7 e - (1 + r)(7e) 2 ] + ' 
Let e be a positive root of the equation 

7 — r — 7£ — (1 + r)(7e) 2 = je, 

that is, 

1 



1 

+ - 



1 7 — r 

+ 



(r + 1)7 7V (l+r) 2 l+r 

Substituting this e into (2.34) and (2.35) and combining the obtained in- 
equalities with (2.26), we finish the proof. 



2.5. Proof of Theorem 2. The first step in the proof of this theorem is 
to show that the data-driven parameter a defined by (1.5) cannot be very 
small, or equivalently, that the ratio D(a)/D(a) is not large. 
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Lemma 8. For any given a° < a and 7 > the following upper bound 
holds 



(2.36) <^ E 



D(a) 
D(a) 



1+7/4 1 1/(1+7/4) 



< CR a °(9) j OBr — 1/2 CRg°{9) _|_ C 



a 2r yD(a 



a 2 D(a) 7 4 ' 



Proof. According to the definition of the empirical risk minimization, 
for any given a° , \Y, Pen] < R a o [Y, Pen] . One can check with a little al- 
gebra that this inequality is equivalent to [see (1.8)] 



k=i 



(2.37) 



]T[i - h & (k)] 2 (e, A) 2 + ^ 2 E ^(QhUk) 

k=i 

n 

-a 2 Y,^\k)h & (k)[e{k) - 1] + (1 + 1 )a 2 Q°(a) 

k=l 

n 

+ 2aJ2^ 1/2 (k)[l-h & (k)] 2 ^(k)e(k) 
k=i 

n n 

<Y J [l-h a o(k)] 2 (9^ k ) 2 + a 2 Y,^Hk)h 2 Q o(k) 

k=l k=l 

n 

-a 2 Y, \-\k)h a o (k) [e (k) — 1] + (1 + 7 )a 2 Q° (a ) 
k=i 

n 

+ 2aY J ^ 1/2 (k)[l-h a o(k)] 2 ^k)e(k), 



k=l 



where h a (k) = 2h a (k) — h^k). Next, representing 



7 



(l+ 7 )Q°(d)= 1 + ^ Q°(a) + ^Q°(d 



we obtain from (2.37) 



70- 



Q°(d) < R a o{9) + o- 2 ^\- 1 (k)h a o(k)ie(k) - 1] 



k=l 



(2.38) 



+ a sup 



j2\-\k)h a (k)[e(k) - l] - (i+i^q°( 

n 

+ 2<j s £ j \- ll2 {k)[h & {k) - h a o(k)\i{k)9{k) - L & {9). 



k=i 
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Since a° is fixed, we get by Jensen's inequality 

t, 1+7/4 



E 



^A- 1 (fc)V(A;)[e 2 (fc)-l] 



k=l 



(2.39) 



<c j2 x ~ 2 (k)~h 2 a o(k) 

.k=l 

= C[D{a°)] 1+7/4 



1/2+7/8 



Next, by Theorem 1, 



Esup 



(2.40) 



f^^Hkyhameik) - 1] - (i + |)q» 



< 



,fc=i 

C£) 1+ T/ 4 (a 

^y3 



1+7/4 



The upper bound for the last line in (2.38) is a little bit more tricky. 
Noticing that h a (-) is a family of ordered smoothers, we get by Lemma 4 
that, for any e > and given p £ (1, 2), 



E 



2a^2\- 1 / 2 (k)[h & (k) - h a °(k)]£(k)6(k) 



k=i 



(2.41) 



< 



4a 2 ^A- 1 (A:)[/ iA (A ; )-V(A ; )] 2 2 (A : ) 
fe=i 

C(p) 



p/2 



1+7/4 



e (l+7/4)/(p-l) ' 

To continue this inequality, note that if a > a°, then 



h a o (k) 



<1, 



h&(k) 
h a o (k) 



> h a (k) 



and therefore 



^[MAO-M*)] 2 * -1 ^)^*) 



fc=l 



(2.42) 



k=l 



h & (k) 



\- 2 {k)e 2 {k) 



h a o (k)_ 

oo 

< maxA^(fc)^ (fc) ^[1 _ h & (k)] 2 e 2 {k) 



k=l 
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oo 

< 4maxA- 1 (A;)/i2 (A;) - h & (k)] 2 9 2 (k). 
k k=l 

Analogously, if a < a° , then 

n 

J2[hao(k)-h & (k)] 2 X^(k)9 2 (k) 



k=i 



(2.43) 



< max\-\k)hl(k) - h a o(k)] 2 6 2 (k) 



k=l 



< 4maxA- 1 (fc)/iI(A:) - V (k)] 2 9 2 (k). 



k=l 

Next, combining (2.41)-(2.43) with Young's inequality, 
(2.44) yx q -x<y~ l l^[q- q l^ -q l '^~ l \ x,y>0,q<l, 
gives 

1+7/4 

2aJ2^ 1/2 (k)[h & (k) -~ha°(kMk)6(k) -L & {9) 



E 



fc=i 



< CB 



2a^\-V 2 (k)[h & (k)-~h a °(kM(k)0(k) 



fe=i 



Aa 2 J2\- 1 (k)[h & (k)-h a o(k)} 2 9 2 (k) 



k=l 



p/2 



1+7/4 



27 



+ CE 



< 



C 



Aa 2 Y,^\k)[K{k)-K°(k)] 2 9 2 (k) 
C 



k=l 



p/2 



1+7/4 



+ 



(l+ 7 /4)/(p-l) p2(l+ 7 /4)/(p-2) 



+ 



C 



£ 



2(l+ 7 /4)/(p-2) 



L d (0) 

p(l+ 7 /4)/(2-p) 
p(l+7/4)/(2-p) 



a max A (k)h a c(k) 

k 



.k=l 



Therefore, minimizing the right-hand side at the above equation in e > 0, 
we get 

1+7/4 

2a ^ \~ l (k)[h & (k) - h a o{k)}i(k)9(k) - L & (9) 



E 



k=l 
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< CI ]T[1 - h a o(k)] 2 e 2 {k) + a 2 max X^{k)h 2 a0 (k) 



1+7/4 



This equation and (2.38)-(2.40) imply 

7 l+7/4 E[(J 2go ((i)] l +7 /4 < C^t 7 ^) + 

and by (2.20) we get 



C[a 2 D(a)} 1+ ^ 4 



(2.45) 7 1+ t/ 4 E 



D(d) 



_£>(a) 
It is easily seen that 



log 1 / 2 



D(a) 



1+7/4 



<C 



a 2 D(a) 



1+7/4 



C 
+ —■ 



E 



(2.46) 



D(d) 
£>(a) 



lDK 1 / 2 



D(d) 
£>(a) 



1+7/4 



(l+ 7 /4) 1 /2+7/8 



E 



D(a) 
D(a) 



1+7/4 



log 



D(a) 



1+7/4 



1/2+7/8 



To finish the proof, let us consider the function f(x) = x log 1//2+7//8 (x) , 
x > 1. Computing its second order derivative, one can easily check that f(x) 
is convex for all x > exp(l) = e. So, /(a: + e — 1) is convex for x > 1. Note 
also that there exists a constant C > such that for all x > 1, 

/(x)>±/(x + e-l)-C. 

Therefore according to (2.46) and Jensen's inequality, 

1+7/4 

E' 



D(a) 


log 


.1/2 D(a) 


D(a) 


D(a) 


>c 


E 


(D(a)\ 









x <^ lo. 



E 



1+7/4 

+ e-l 
D(a^ 1+7/4 



D(a) 



+ e-l 



1 1/2+7/8 



1 



C. 



Finally, substituting this inequality into (2.45) and inverting f(x), we 
arrive at (2.36). □ 



The next lemma controls the cross term in the empirical risk. 
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Lemma 9. Let h £ a (k) = [(1 + 2e)h a (k) - eh 2 a (k)]/(l + e). Then for any 
given e > and a° £ (0, a], 



2aB 



J2[l- h%(k)]9(k)X-y 2 (k)ak) 



k=l 



(2.47) 



< 



CR a o(9)^ „ 1/2 CR a o(6) t Ca 2 D{a 



1 



a 2 D(a) 



+ 



1/2 



e]T[i - h & {k)fe 2 {k) + Yfi - Ko(k)] 2 e 2 (k) 



k=l 



k=l 



1/2 



Proof. Since h e a (k) is a family of ordered smoothers, combining Lemma 
5 with the obvious inequalities max/% \~ 1 (k)h 2 f (k) < D(a) and h e a {k) > h a (k), 
we obtain 



2aE 



£[i-/4(A0Mfc)A- 1/2 (^(fc) 



fc=l 



2aE 



(2.48) 



<CC7 



E[^^)-^(fc^WA- 1/2 (A:)e(A:) 

1/2 



k=l 



VD{a)Y^-K°(k)] 2 6 2 {k) 



k=l 



+ Ca 



D(a°)EY,[l-ha(k)] 2 e 2 (k) 



k=l 



1/2 



Next, according to (2.20), Q°(a) > D(a)y/log[D(a)/D(a)], and we get 

D(a )<Ca- 2 R a .(e)log^ 2 ^^- y 

a A u\a) 

Substituting this inequality and (2.36) in (2.48), we obtain (2.47). □ 

We are now in a position to prove Theorem 2. Let e £ (0, 1] be a given 
number to be defined later on. According to (1.8) and (1.9), we obtain the 
following equation for the skewed excess risk: 

8(e) = sup E{||0 -6 & \\ 2 - (1 + e){R & [Y,Pen] +C}} 

6»6K n 

( n n 

sup E< -e^[l -h & (k)] 2 e 2 (k) -ea 2 Y,^Hk)h 2 & (k) 



k=l 



k=l 
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(2.49) 



Y. GOLUBEV 
.2 /no/ - > 



-(l + e)(l + 7KQ°(d) 

n 

- 2a]T{l +e - [(1 + 2e)h & (k) - eh%(k)]} 
k=l 

x #(A;)A- 1/2 (fc)£(fc) 
+ a 2 A- 1 ^) [2(1 + s)h & (k) - ehl(k)] ^(k) - 1] } . 



k=l 



To control the last line at the right-hand side of this equation, we use that 
h e a (k) = [2(1 + e)h a {k) — e/i 2 (k)}/(2 + e) is a family of ordered smoothers. 
Hence, Lemmas 3, 6 and 8 imply 



a 2 E^ A^ 1 (fc)[2(l + e)h & (k) - ehl(k)][f(k) - 1] 



(2.50) 



fc=i 



< c R a o(6) 1/2 C^ a o(g) + Ca 2 D(a) 



7 



<7 2 7_D(a) 



Next, substituting (2.50) and (2.47) into (2.49), we obtain the following 
upper bound for the skewed excess risk: 



S{e) <eR a o{6) + 



C 



^(9),„„-i/j^(8) , <? 2 D(a) 



7 



log~ 



a 2 D(a) 



+ 



Finally, substituting this upper bound into 

E\\e-§ & f<{l + e)R a o(e) + S(e) 
and minimizing the obtained inequality in e, we get 



<r(0) + O(0)inN e + 



E||r? — 0q || 

<r(0)jl + 
thus finishing the proof. 



e>0 



1 



1 



a 2 D(d) 



— log 7 



+ 



a 2 L>(a) 7 4 r(fl) 



+ 



1/2 



7 4 r(#) 
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